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also how to align it and what to do with the sign and leading zeros. The formatting letter for an 
integer is “d”, so the following are legal directives and their meaning: 


Format Explanation Result for value 1234 | 
“{:07d}’ : An integer right aligned in a 7-charactes : "0001234" : 
3 _ space filled on the left with zeros. ee bores beware Srectant estore 
=f Td} | : A right aligned integer in a 7-character space 1,234" 
cu snuse With a‘, every 3 digits set Pee Moers 
* {7x}? : A right aligned integer in hexadecimal. Sha 4D2" 
“{:07d}’: An integer night aligned in a 7-character : "0001234" 

__ Space £ filled on the left with zeros. 


“7° : A right aligned integer in a 7-character space fw a)234e 
“cans With a‘ every 3 digits aon ee 
“{7x)’ | A right aligned integer in hexadecimal. 02" 00 $D2" 

Floating point numbers have the extra issue of the decimal place. The format character is 
often “f,” but it can be “e” i acuta format or “g” for general format, meaning the system 
decides whether to use “p e.” Otherwise, the formatting of a floating point is like that of 
previous versions of Python ae like that of C and C++: 


Explanation Result for value 12. 321 | 
"£3 digits right ofthe decimal i." 
” : 6 digits, 3 to the right of the decimal : 
8.1)" 5 digits. 1 to the right, left adjusted 22038 
; 8 places, exponentialfom = — :'1.232100e+01" 


g}’ :Splaces, systemdecides 000" 42.320" 


The next three values to be printed are floating point: the mass of the meteorite and the 
location, as latitude and longitude. Printing each of these as 7 places, 2 to the right of the 
decimal, would seem to work. Or, as a format: “{:7.2f}.” 


The solution to the problem is now at hand. The data is read line by line, converted into a 
list, and then the fields are formatted and printed in two steps: 


infile = open ("met.txt", "r" 

inline = infile.readline() 

print (" Place Class Mass Latitude 

Longitude") 

while inline !="": 
inlist = inline.split(",") 
mass = float(inlist[4]) 
lat = float(inlist[7]) 
long = float(inlist[8]) 
print('{:16s} {:14s} {:7.2£}'.format(inlist[0], 

inlist[3],mass) ,end="") 

print (' {:7.2£} {:7.2£}'.format(lat, long) ) 
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inline = infile.readline () 


infile.close() 


The result is: 

Place Class Mass Latitude Longitude 
Bloomington 5 1368 ewe Yo, =LaOye5 
Bogou LL6 21.09 =66225 =53.06 
Alessandria L4 106,11, A538. 0 10.96 

Bo Xian fis) 85.92 Care =90 620 
Ashdon Rucrite-mmict see a, —Oeaee -178.84 
Berduc L6 Id. -oe220 LO7 4 10 


There are many more formatting directives, and a huge number of their combinations. 


EY] ADVANCED DATA FILES 


File operations were discussed Chapter 5, but the discussion was limited to files containing 
text. Text is crucial because it is how humans communicate with the computer; people are 
unhappy about having to enter binary numbers. On the other hand, text files take up more space 
than needed to hold the information they do. Each character requires at least one byte. The 
number 3.1415926535 thus takes up 12 bytes, but if stored as a floating point number it needs 
only 4 or 8 depending on precision. 


The file system on most computers also permits a variety of operations that have not been 
discussed. This includes reading from any point in a file, appending data to files, and 
modifying data. The need for processing data effectively is a main reason for computers to 
exist at all, so it is important to know as much as possible about how to program a computer 
for these purposes. 


Binary Files 


A binary file is one that does not contain text, but instead holds the raw, internal 
representation of its data. Of course, all files on a computer disk are binary in the strict sense, 
because they all contain numbers in binary form, but a binary file in this discussion does not 
contain information that can be read by a human. Binary files can be more efficient that other 
kinds, both in file size (smaller) and the time it takes to read and write them (less). Many 
standard files types, such as MP3, exist as binary files, so it is important to understand how to 
manipulate them. 
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Example: Create a File of Integers 


The array type holds data in a form that is more natural for most computers than a list, and 
also has the tofile() method built in. If a collection of integers is to be written as a binary file, 
a first step is to place them into an array. If a set of 10000 consecutive integers are to be 
written to a file named “ints,” the first step is to import the array class and open the output file. 
Notice that the file is open in “wb” mode, which means “write binary”: 


from array import array 
output file = open('ints', 'wb') 
Now create an array to hold the elements and fill the array with the consecutive integers: 
arr = array('i') 
for k in range (10000, 20000): 
arr .append (k) 
Finally, write the data in the array to the file: 
arr .tofile (out) 
out.close () 
This file has a size listed as 40kb on a Windows PC. A file having the same integers written 
as text is 49kb. This is not exactly a huge saving of space, but it does add up. 
Reading these values back is just as simple: 
inf = open ('ints', 'rb') 
arrin = array('i') 
for k in range (0, 10001): 
try: 
arrin.fromfile(inf, 1) 
except: 
break 
print (arrin[k]) 
inf .close () 
The try is used to catch an end of file error in cases where the number of items on the file is 
not known in advance. Or just because always doing so is a good idea. 


Sometimes a binary file will contain data that is all of the same type, but that situation is not 
very common. It is more likely that the file will have strings, integers, and floats intermixed. 
Imagine a file of data for bank accounts or magazine subscriptions; the information included 
will be names and addresses, dates, financial values, and optional data, depending on the 
specific situation. Some customers have multiple accounts, for example. How can binary files 
be created that contain more than one kind of information? By using structs. 


Parker, James R.. Python. an Introduction to Programming, Mercury Learning & Information, 2021. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/dogus-ebooks/detail.action?docID=4895083. 
Created from dogus-ebooks on 2023-09-12 18:45:34. 


Copyright © 2021. Mercury Learning & Information. All rights reserved. 


wwe The Struct Module 


The struct module permits variables and objects of various types to be converted into what 
amounts to a sequence of bytes. It is a common claim that this is in order to convert between 
Python forms and C forms, because C has a struct type (short for structure). However, many 
files exist that consist of mixed-type data in raw (i.e., machine compatible) form that have been 
created by many programs in many languages. It is possible that C is singled out because the 
name struct was used. 


Example: A Video Game High Score File 


Video game players need little incentive to try hard to win a game, but for many years a 
special reward has been given to the better players. The game 
“remembers” the best players and lists them at the beginning and end of the game. This kind of 
ego boost is a part of the reward system of the game. The game program stores the information 
on a file in descending order of score. The data that is saved is usually the player’s name or 
initials, the score, and the date. This mixes string with numeric data. 


Consider that the player’s name is held in a variable name, the score is an integer score, 
and the date is a set of three strings year, month, and day. In this situation the size of each 
value needs to be fixed, so allow 32 characters for the name, 4 for year, 2 for month, and 2 for 
day. The file was created with the name first, then the score, then the year, month, and day. The 
order matters because it will be read in the same order that it was written. On the file the data 
will look like this: 


ececocccececaocccecececcecoccccdcdcd i ale gale cCeocd cca ‘ele. 
Player’s name Score Year Month Da: 


Each letter in the first string represents a byte in the data for this entry. The ‘c’s represent 
characters; the ‘i’s represent bytes that are part of an integer. There are 44 bytes in all, which 
is the size of one data record, which is what one set of related data is generally called. A file 
contains the records for all of the elements in the data set, and in this case a record is the data 
for one player, or at least one time that the player played the game. There can be multiple 
entries for a player. 


One way to convert mixed data like this into a struct is to use the pack() method. It takes a 
format parameter first, which indicates what the struct will consist of in terms of bytes. Then 
the values are passed that will be converted into components of the final struct. For the 
example here the call to pack() would be: 


s = pack ("32si4s2s2s", name, score, year, month, day) 


The format string is “32si4s2s2s”; there are 5 parts to this, one for each of the values to be 
packed: 


32s is a 32-character long string. It should be of type bytes. 
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i is one integer. However, 2i would be two integers, and 12i is 12 integers. 
4s is a 4-character long string. 
2s is a 2-character long string. 
Other important format items are: 
c is a character 
f is a float 
d is a double precision float 
The value returned from pack() has type bytes, and in this case is 44 bytes long. The high 


score file consists of many of these records, all of which are the same size. A record can be 
written to a file using write(). So, a program that writes just one such record would be: 


from struct import * 


f = open ("hiscores", "wb") 

name = bytes("Jim Parker", 'UTF-8') 

score = 109800 

year = b"2015" 

month = b"12" 

day = b"26" 

s = pack ("32si4s2s2s", name, score, year, month, day) 
£.write (s) 


Reading this file involves first reading the string of bytes that represented a data record. 


Then it is unpacked, which is the reverse of what pack() does, and the variables are passed to 
the unpack() function to be filled with data. The unpack() method takes a format string as the 
first parameter, the same kind of format string as pack() uses. It will return a tuple. An example 
that reads the record in the above code would be: 


from struct import * 


f 


s 


open ("hiscores", "rb") 
£.read (44) 


name ,score,year,month,day = unpack ("32si4s2s2s", s) 


name = name.decode ("UTF-8") 
year = year.decode ("UTF-8") 
month = month.decode ("UTF-8") 
day = day.decode ("UTF-8") 


The data returned by unpack are bytes, and need to be converted into strings before being 
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used in most cases. Note the input mode on the open() call is “rb,” read binary. 


A file in this format has been provided, and is named simply ‘hiscore.’ When a player plays 
the game they will enter their name; the computer knows their score and the date. A new entry 
must be made in the ‘hiscore’ file with this new score in it. How is that done? 


Start with the new player data for Karl Holter, with a score of 100000. To update the file it 
is opened and records are read and written to a new temporary file (named “tmp”) until one is 
found that has a smaller score than the 100000 that Karl achieved. Then Karl’s record is 
written to the temporary file, and the remainder of ‘hiscores’ is copied there. This creates a 
new file named “tmp” that has Karl’s data added to it, and in the correct place. Now that file 
can be copied to “hiscores” replacing the old file, or the file named “tmp” can be renamed as 
“hiscores.” This is called a sequential file update. 


Renaming the file requires access to some of the operating system functions in the module 
os; in particular: 
os.rename ("tmp", "hiscores") 


Hwee Random Access 


It seems natural to begin reading a file from the beginning, but that is not always necessary. 
If the data that is desired is located at a known place in the file, then the location being read 
from can be set to that point. This is a natural consequence of the fact that disk devices can be 
positioned at any location at any time. Why not files too? 

The function that positions the file at a specific byte location is seek(): 

£.seek(44) # Position the file at byte 44, 


# which is the second record in the hiscores 
# file. 


It’s also possible to position the file relative to the current location: 
£f.seek(44, 1) # Position the file 44 bytes from this 
# location, 
# which skips over the next record in 


# hiscores. 


A file can be rewound so that it can be read over again by calling f.seek(0), positioning the 
file at the beginning. It is otherwise difficult to make use of this feature unless the records on 
the file are of a fixed size, as they are in the file ‘hiscores,’ or the information on record sizes 
is saved in the file. Some files are intended from the outset to be used as random access files. 
Those files have an index that allows specific records to be read on demand. This is very much 
like a dictionary, but on a file. Assuming that the score for player Arlen Franks is needed, the 
name is searched for in the index. The result is the byte offset for Arlen’s high score entry in 
the file. 
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Arlen’s record starts at byte 352 (8th record * 44 bytes). He just played the game again and 
improved his score. Why not update his record on the file? The file needs to be open for input 
and output, so mode “rb+,” meaning open a binary file for input and output, would work in this 
case. Then position the file to Arlen’s record, create a new record, and write that one record. 
This is new—being able to both read and write the same file seems odd, but if the data being 
written is exactly the same size as the record on the file then no harm should come from it. The 
program is: 

# read and print hiscore file 


from struct import * 


£ = open ("hiscores", "r+b") # Open binary file,input and 
# output 
pos = 44*8 # Desired record is 8, 44 
# bytes per 
£f.seek(pos) # Seek to that position one 
# the file 
s = f.read(44) # Read the target record 
name = b!'Arlen Franks' # Make a new one with a new 
# score 
score = 100300 
year = b'2015' 
month = b!12' 
day = b'26' # Pack the new data 
ss = pack("32si4s2s2s", name,score, year,month, day) 
f.seek (44*8) # Seek the original position 
# again! 
f.write(ss) # Write the new data over 
# the old 
£f.close () # Close the file 


This works fine, provided that the position of Arlen’s data in the file is known. It does not 
maintain the file in descending order, though. 


Example: Maintaining the High Score File in Order 


The circumstances of the new problem are that a player only appears in the high score file 
once and the file is maintained in descending order of score. If a player improves their score, 
then their entry should move closer to the beginning of the file. This is a more difficult problem 
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than before, but one that is still practical. So, presume that a player has achieved a new score. 
The entire process should be: 


Get the player’s old Read the file, get the player’s record, unpack it. 

score. 

Is the new score larger? If not, close the file. Done. 

Yes, so find out where __ Look at successively preceding records until one is found that has 4 
the score larger score. 

belongs, in the file. 


Place the new record — Copy the records from the new position for the record ahead one 
where it belongs. position until the old position is reached. 


The process is like moving a playing card closer to the top of the deck while leaving the 
other cards in the same order. It’s probably more efficient to move the record while searching 
for the correct position, though. Each time the previous record is examined, if it does not have 
a larger score then the record being placed is copied ahead one position. This results in a 
pretty compact program, given the nature of the problem, but it is a bit tricky to get right. For 
example, what if the new score is the highest? What if the current high score gets a higher 
score? (See: Exercise 11) 


STANDARD FILE TYPES 


Everyone’s computer has files on it that the owner did not create. Some have been 
downloaded; some merely came with the machine. It is common practice to associate specific 
kinds of files, as indicated initially by some letters at the end of the file name, with certain 
applications. A file that ends in “.doc,” for example, is usually a file created by Microsoft 
Word, and a file ending in “.mp3” is usually a sound file, often music. Such files have a format 
that is understood by existing software packages, and some of them (“.gif’”) have been around 
for thirty years. 


Each file type has been designed to make certain operations easy, and to pass certain 
information to the application. Over the years a set of de facto standards have evolved for how 
these files are laid out, and for what data are provided for what kinds of file. And yet most 
users and many programmers do not understand how these files are structured or why. Many 
users do not care, of course, and some programmers too, but opening up these files to some 
scrutiny is an educational experience. 


Image Files 


Images have been processed using computers since the 1960s when NASA started 
processing images at the Jet Propulsion Laboratory. After some years people (scientists, 
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mainly) decided that having standards for computer images would be useful. The first formats 
were ad hoc, and based essentially on raw pixel data. Raw data means knowing what the 
image size is in advance, so headers were introduced providing at least that information, 
leading to the TARGA format (.tga) and tiff (Tagged Image File Format) in the mid-1980s. 
When the Internet and the World Wide Web became popular, the GIF was invented, which 
compressed the image data. This was followed by JPEG and other formats that could be used 
by web designers and rendered by browsers, and each had a specific advantage. After all, 
reducing size meant reducing the time it took to download an image. 


Once a file format has been around for a few years and has become successful it tends to 
stick around, so many of the image file formats created in the 1980s are still here in one form 
or another. There are new ones too, like PNG (Portable Network Graphics), which have been 
specifically designed for the Internet. Older ones (like JPEG) have found common uses in new 
technologies, like digital cameras. A programmer/computer scientist needs to know about the 
nature of the various formats, their pros and cons as it were. 


§.5.2 Beg 


The Graphics Interchange Format is interesting from many perspectives. First, it uses 
compression to reduce the size of the file, but the compression method is not lossy, meaning 
that the image does not change after being compressed and then decompressed. The 
compression algorithm used is called LZW, and will be discussed in Chapter 10. GIF uses a 
color map representation, so an element in the image is not a color, but instead is an index into 
an array that holds the color. That is, if v = image[row][column] then the color of that pixel is 
(red[v], green[v], blue[v]). The color itself could be a full 24 bits, but the value v is a byte, 
and so in a GIF there can only be 256 distinct colors. GIF uses a little-endian representation, 
meaning that the least significant byte of multi-byte objects comes first on the file. 


One advantage of the GIF is that one of the colors can be made transparent. This means that 
when this color is drawn over another, the color below shows through. It is essentially a “do 
not draw this pixel” value. It is important for things like sprites in computer games. Another 
advantage of GIF is that multiple images can be stored ina single file, allowing an animation 
to be saved ina single file. GIF animations have been common on the Internet for many years, 
and while they usually represent small, brief animations such as Christmas trees with flashing 
lights, they can be as long and complex as television programs. Still, the fact that there can only 
be 256 different colors can be a problem. 


A GIF is a binary file, but the first six characters are a header block containing what is 
called a magic number, or an identifying label. For a GIF file the three characters are always 
“GIF” and the next three represent the version; for the 1989 standard the first six characters are 
“GIF89a.” Magic numbers are common in binary files, and are used to identify the file type. 
The file name suffix does not always tell the truth. 


Following the header is the logical screen descriptor, which explains how much screen 
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space the image requires. This is seven bytes: 


Canvas width 2 bytes 
Canvas height 2 bytes 
Packed byte 1 byte 


A set of flags and small values 


Bit 8 765 4 32:1 
Global color sort size of 
Color resolution flag global color 
Table? table 
Background color index 1 byte 
Pixel aspect ratio 1 byte 


This is followed by the global color table, other descriptors, and the image data. The details 
can be found in manuals and online. The information in the first few bytes is critical, though, 
and the knowledge that LZW compression is used means that the pixels are not immediately 


available. Decompression is done to the image as a whole. 
from struct import * 
£ = open ("test.gif", "rb") 
s = f.read (13) # Read the header 
id, ht, wd, flags, bci,par = unpack('6shhBBB', s) 
#6s hhBBB 
£.close() 
id = id.decode ("utf-8") 
print (id) 
print ("Height", ht, "Width", wd) 
print("Flags:", flags) 
print ("Background color index: ", bci) 


print ("Pixel aspect ratio:", par) 


ieee JPEG 
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never be used for astronomy, for example, 


A JPEG image uses a lossy compression scheme, and so the image is not the same after 
compression as it was before compression. For this reason it should never be used for 
scientific or forensic purposes when measurements will be made using the image. It should 


Parker, James R.. Python. an Introduction to Programming, Mercury Learning & Information, 2021. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/dogus-ebooks/detail.action?docID=4895083. 


Created from dogus-ebooks on 2023-09-12 18:45:34. 


although it is perfectly fine for portraits and landscape photographs. 


The name JPEG is an acronym for the Joint Photographic Experts Group, and actually refers 
to the nature of the compression algorithm. The file format is an envelope that contains the 
image, and is referred to as JFIF (JPEG File Interchange Format). The file header contains 20 
bytes: the magic number is the first 4 and bytes 6-10. The first 4 bytes are hex FF, D8, FF, and 
EO. Bytes 6-10 should be “JFIF\O,” and this is followed by a revision number. A short 
program that decodes the header is: 

from struct import * 


£f = open ("test.jpg", "rb") 
f.read (20) # Read the header 
bl, b2,a1l,a2,sz,id,vl, v2,unit,xd,yd, xt,yt = 
unpack ('BBBBh5sBBBhhBB', s) 
#7BBBBhSsBBBHABB 
£.close () 
id = id.decode ("utf-8") 
print (id, "revision", vl, v2) 
if bl==Oxff and b2==0xd8: 
print ("SOI checks.") 
else: 
print ("SOI fails.") 
if al==Oxff and a2==0xe0: 


print ("Application marker checks.") 


else: 

print ("Application marker fails.") 
print ("App 0 segment is", sz, "bytes long.") 
if unit == 

print ("No units given.") 
elif unit == 

print ("Units are dots per inch.") 
elif unit == 

print ("Units are dots per centimeter.") 
if unit==0: 

print ("Aspect ratio is ", xd, ":", yd) 


else: 
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print ("Xdensity: ", xd, " Ydensity: ", yd) 
if xt==0 and yt==0: 

print ("No thumbnail") 
else: 

print ("Thumbnail image is ", xt, "x", yt) 


The compression scheme used in JPEG is very involved, but is does cause certain 
identifiable artifacts in an image. In particular, pixels near edges and boundaries are smeared, 
essentially averaging values across small regions 
(Figure 8.1). This can cause problems if a JPEG image is to be edited, for example in 
Photoshop or Paint. 


Figure 8.1 
JPEG images tend to show artifacts at places where pixels change rapidly, like corners and edges. 


The Tagged Image File Format has a potentially huge amount of metadata associated with 
it, and that is all in text form in the file. It’s a favorite among scientists because of that: the 
device used to capture the image, the focal length of the lens, time, subject, and scores of other 
information can accompany the image. In fact, the TIFF has been seconded for use with 
numeric non-image data as well. The other reason it is popular is that is can be used with 
uncompressed (raw) data. 


The word Tagged comes from the fact that information is stored in the file using tags, such as 
might be found in an HTML file—except that the tags in a TIFF are not in text form. A tag has 
four components: an ID (2 bytes, what tag is this?), a data type (2 bytes, what type are the items 
in this tag?), a data count 
(4 bytes, how many items?), and a byte offset (4 bytes, where are these items?). Tags are 
identified by number, and each tag has a specific meaning. Tag 257 means Image Height and 
256 is Image Width; 315 is the code meaning Artist, 306 means Date/Time, and 270 is the 
Image Description. They can be in any order. In fact, the whole file structure is flexible 
because all components are referenced using a byte offset into the file. 


A TIFF begins with an 8-byte Image File Header (IFH): 
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Byte order: This is 2 bytes, and is “II” if data is in little-endian form and “MM” if it is big- 
endian. 

Version Number: Always 42. 

First Image File Directory offset: 4 bytes, the offset in the file of the first 

image. 

The other important part of a TIFF is the Image File Directory (IFD), which contains 
information about the specific image, including the descriptive tags and data. The IFH is 
always 8 bytes long and is at the beginning of the file. An IFD can be almost any size and can 
be anywhere in the file; there can be more than one, as well. The first IFD is found by 
positioning the file to the offset found in the IFH. Subsequent ones are indicated in the IFD. The 
IFD stricture is: 


Number of tags: 2 bytes 

Tags: Array of tags, size unknown 

Next IFD offset: 4 bytes. File offset of the next IFD. If there are no more, 
then =0. 


The structure of a tag was given previously, so a TIF is now defined. The image data can be, 
and frequently is, raw pixels, but can also be compressed in many ways as defined by the tags. 
The program below reads the IFH and the first IFD, dumping the information to the screen: 

# TIFF 


from struct import * 


£ 


s 


open ("test.tif", "rb") 
f.read (8) # Read the IFH 
id, ver, off = unpack('2shL', s) 
#2s hb 


id = id.decode ("utf-8") 
print ("TIFF ID is ", id, end="") 
if id == "II": 
print ("which means little-endian.") 
elif id == "mm": 
print ("which means big-endian") 
else: 
print ("which means this is not a TIFF.") 
print ("Version", ver) 
print ("Offset", off) 
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.seek (off) # Get the first IFD 
= 0 
f.read (2) # Number of tags 
= b[0] + b[1]*256 
#n = int(s.decode (“utf-8”) ) 
for i in range(0,n): 
s = f.read (12) # Read a tag 
id,dt,dc,do = unpack ("hhLL", s) 
print ("Tag ", id, "type", dt, "count", dc, "Offset", do) 


£.close() 


5 oo 3 Mm 
Il 


When this program executes using “test.tif” as the input file, the first two tags in the IFD are 
256 and 257 (width and height) which are correct. 


tee PNG 


A PNG (Portable Network Graphics) file consists of a magic number, which in this context 
is called a signature and consists of 8 bytes, and a collection of chunks, which resemble TIFF 
tags. There are 18 different kinds of chunk, the first of which is an image header. The Signature 
is always: 137 80 78 71 13 10 26 10. The bytes 80 78 71 are the letters “PNG.” 

A chunk has either 3 or 4 fields: a length field, a chunk type, an optional chunk data field, 
and a check code based on all previous bytes in the chunk that is used to detect errors (called a 
cyclic redundancy check, or CRC). 


The image header chink (IHDR) has the following structure: 


Image width: 4 bytes 

Image height: 4 bytes 

Bit depth: 1 byte. Number of bits per sample (1,2,4,8, or 16). 

Color type: 1 byte. 0 (grey), 2 (RGB), 3 (color map), 4 (greyscale with transparency) o 
6 (RGB with transparency) 

Compression 1 byte. Always 0. 


method: 

Filter method: 1 byte. Always 0. 

Interlace 1 byte. O=no interlace. 1=Adam7 interlace 
method: (See: references) 


This file has compression, but it is non-lossy. It also, like GIF, allows transparency, but 
allows full RGB color. It does not have an option for animations, though. Reading the signature 
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and the first (IHDR) chunk is done in the following way: 
# PNG 
from struct import * 
b2 = (137, 80, 78, 71, 13, 10, 26, 10) # Correct header 
types = ("Grey", "", "RGB", "Color map", 
"Grey with alpha", "", "RGBA") # Color types 
£f = open ("test.png", "rb") 
s = f.read (8) # Read the header 
b1 = unpack('8B', s) 
if bl == b2: 
print ("Header OK") 
else: 
print ("Bad header") 


s = f.read(8) # The next chunk must be the IHDR 


length, type = unpack (">I4s", s) # Unpack the first 
8 bytes print ("First chunk: Length is", length, "Type:", 
type) 


s = f.read (length) # We know the length, read the chunk 
wd,ht,dep,ctype,compress, filter, interlace = 
unpack (">ii5B", s) 
#I IBBBBB 
print ("PNG Image width=", wd, "Height=", ht) 
print ("Image has ", dep, "bytes per sample.") 
print ("Color type is ", types[ctype] ) 
if compress == 
print ("Compression OK") 
else: 
print ("Compression should be 0 but is", compress) 
if filter==0: 
print ("Filter is OK") 
else: 
print ("Filter should be 0 but is", filter) 


if interlace==0: 
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print ("No interlace") 
elif interlace == 

print ("Adam7 interlace") 
else: 

print ("Bad interlace specified: ", interlace) 
£.close() 


Sound Files 


A sound file can be a lot more complex than an image file, and substantially larger. To 
properly play back a sound, it is critical to know how it was sampled: how many bits per 
sample, how many channels, how many samples per second, compression schemes, and so on. 
The file must be readable in real time or the sound can’t be played without a separate decoding 
step. All that is really needed to display an image is its size pixel format and compression. 


There are, once again, many existing audio file formats. MP3 is quite complex, too much so 
to discuss here. The usual option on a PC would be “.wav” and, as it happens, that format is 
not especially complicated. 


WAV 


A WAV file has three parts: the initial header, used to identify the file type; the format sub- 
chunk, which specifies the parameters of the sound file; and the data sub-chunk, which holds 
the sound data. 


The initial header should contain the string “RIFF” followed by the size of the file minus 8 
bytes (i.e., the size from this point forward), and the string “WAVE.” This is 12 bytes in size. 


The next “sub-chunk” has the following form: 


ID: = “fmt” 

Size1: Size of the rest of the sub-chunk 
Format: 1 if PCM, another number if compressed 
No. of Channels: mono=1, stereo=2, etc. 

Sample rate: Sound samples per second. CD rate is 44100 
Alignment: Should be No. of channels*sample rate*bits per 
sample/8 
Bits per sample: AKA quantization. Bits in each sample: 8, 12 are 

usual. 
The final section contains the 
following: 
ID: = “data” 


Size: Number of bytes in the data 
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Data: The actual sound data, as a large block of Size bytes. 


A program that reads the first two sub-chunks is: 
# WAV 


from struct import * 


£ 


s 


open("test.wav", "rb") 

£f.read (12) 

riff,sz,fmt = unpack ("4si4s", s) 
riff = riff.decode ("utf-8") 

fmt = fmt.decode ("utf-8") 

print (riff, sz, "bytes ", fmt) 


s = f.read (24) 


id, sz1, fmt,nchan,rate,bytes,algn, bps 


#4s ih hiihh 
id = id.decode ("utf-8)") 


print ("ID is", id, "Channels ", 
rate) 


print ("Bits per sample is ", bps) 


if fmt==1: 
print ("File is PCM") 
else: 
print ("File is compressed ", 


print ("Byterate was ar 


rate*nchan*bps/8) 


Other Files 


unpack 
("4sihhiihh", 


"Sample rate is 


"should 


be 


s) 


W 


Every type of file has a specific purpose and a format that is appropriate for that purpose. 
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For that reason the nature of the headers and the file contents differ, but the fact that the headers 
and other specific fields exist should by now make some sense. When a program is asked to 
open a file there should be some way to confirm that the contents of the file can be read by the 
program. The code that has been presented so far is only sufficient to determine the file type 
and some of its basic parameters. The code needed to read and display a GIF, for example, 


would likely be over 1000 lines long. It is important, for someone who wishes to be a 
programmer, to see how to construct a file so that it can be used effectively by others and so 
that other programmers can create code that can identify that file and use it. 


With that in mind, some other file types will be described briefly and considered as 
examples of how to organize data into a file. 


HTML 


An HTML (HyperText Markup Language) file is one that is recognized by a browser and 
can be displayed as a web page. It is a text file, and can be edited, saved, and redisplayed 
using simple tools; the fancy web editors are useful, but not necessary. 


The first line of text in an HTML file should be either a variation on: 


<!DOCTYPE html> 


or a variation on: 


<html> 


The problem is that these are text files, so spaces and tabs and newlines can appear without 
affecting the meaning. Browsers are also supposed to be somewhat forgiving about errors, 
displaying the page if at all possible. A simple example that shows some of the problems while 
being largely correct is: 

import webbrowser 

£ = open ("other.html") 

html = False 

while True: # Look at many lines 

s = f.readline() # Read 


s 


s.strip() # Remove white space 
# (blanks, tabs) 
s = s.lower() # Convert to lower case for 
# compare 
k = (s.find("doctype")) # doctype found? 
if k>0O: # Yes 
kk = s.find("html") # Look also for ‘'html' 
if kk >= k+7: # Found it, after DOCTYPE 
html = True # Close enough 
break 


else: 
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k = s.find("html") # No 'doctype'. 'html'? 


if k>0 and s[k-1] == "<": # Yes. Preceded by '<'? 
html = True # Yes, Close enough. 
break 


if len(s) > 0: # is the string non-blank? 
html = False # Yes. So it is not HTML 
# probably 
break 


if html: 
webbrowser .open new tab('other.html') 
else: 
print ("This is not an HTML file.") 
This program uses the webbrowser module of Python to display the web page if it is one. 
The call webbrowser.open new tab('other.html1') opens the page ina new tab, 


if the browser is open. This module is not a browser itself. It simply opens an existing 
installed browser to do the work of displaying the page. 


EXE 


This is a Microsoft executable file. The details of the format are involved, and require a 
knowledge of computers and formats beyond a first-year level, but detecting one is relatively 
simple. The first two bytes that identify an EXE file are: 

Byte 0: 0x4D 
Byte 1: Ox5a 

It is always possible that the first two bytes of a file will be these two by 
accident, but it is unlikely. If the file being examined is, in fact, an EXE file, then a Python 
program can execute it. This uses the operating system interface module os: 

import os 


os.system ("program.exe") 


E¥a SUMMARY 


A fair definition of Computer Science would be the discipline that concerns itself with 
information. Computers can only operate on numbers, so an important aspect of using data is 
the representation of complex things as numbers. Most data consist of measurements of 
something, and as such are fundamentally 
numeric. 
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A dictionary allows a more complex indexing scheme: it is accessed by content. A 
dictionary can be indexed by a string or tuple, which in general would be referred to as a key, 
and the information at that location in the dictionary is said to be associated with that key. 


A Python array is a class that mimics the array type of other languages and offers efficiency 
in storage, exchanging that for flexibility. The struct module permits variables and objects of 
various types to be converted into what amounts to a sequence of bytes. It has a pack() and an 
unpack() method for converting Python variables into sequences of bytes. 


The string format() method allows a programmer to specify how values should be placed 
within a string. The idea is to create a string that contains the formatted output, and then print 
the string. 

Python data can be written to files in raw, binary form. It is also possible to position the file 
at any byte ina binary file, allowing the file to be read or written at any location. 
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